download.png

Context

In this research, a dataset containing supply chain data of an online store is examined.First, data preprocessing and cleaning is done. Then, exploratory analysis is performed on the data. And finally, two types of predictions are made on the data using machine learning algorithms.First, predict the amount of order sales. Second, pre-risk delivery of the order on time.

To what extent are machine learning algorithms effective in predicting sales and the risk of sending orders on time?

What factors affect the amount of order sales?

Is there a special relationship and correlation between the features in the dataset?

The Features of this dataset are as follows:

Features Description

Type : Type of transaction made

Days for shipping (real) : Actual shipping days of the purchased product

Days for shipment (scheduled) : Days of scheduled delivery of the purchased product

Benefit per order : Earnings per order placed

Sales per customer : Total sales per customer made per customer

Delivery Status : Delivery status of orders: Advance shipping , Late delivery , Shipping canceled , Shipping on time

Late_delivery_risk : Categorical variable that indicates if sending is late (1), it is not late (0).

Category Id : Product category code

Category Name : Description of the product category

Customer City : City where the customer made the purchase

Customer Country : Country where the customer made the purchase

Customer Email : Customer's email

Customer Fname : Customer name

Customer Id : Customer ID

Customer Lname : Customer lastname

Customer Password : Masked customer key

Customer Segment : Types of Customers: Consumer , Corporate , Home Office

Customer State : State to which the store where the purchase is registered belongs

Customer Street : Street to which the store where the purchase is registered belongs

Customer Zipcode : Customer Zipcode

Department Id : Department code of store

Department Name : Department name of store

Latitude : Latitude corresponding to location of store

Longitude : Longitude corresponding to location of store

Market : Market to where the order is delivered : Africa , Europe , LATAM , Pacific Asia , USCA

Order City : Destination city of the order

Order Country : Destination country of the order

Order Customer Id : Customer order code

order date (DateOrders) : Date on which the order is made

Order Id : Order code

Order Item Cardprod Id : Product code generated through the RFID reader

Order Item Discount : Order item discount value

Order Item Discount Rate : Order item discount percentage

Order Item Id : Order item code

Order Item Product Price : Price of products without discount

Order Item Profit Ratio : Order Item Profit Ratio

Order Item Quantity : Number of products per order

Sales : Value in sales

Order Item Total : Total amount per order

Order Profit Per Order : Order Profit Per Order

Order Region : Region of the world where the order is delivered : Southeast Asia ,South Asia ,Oceania ,Eastern Asia, West Asia , West of USA , US Center , West Africa, Central Africa ,North Africa ,Western Europe ,Northern , Caribbean , South America ,East Africa ,Southern Europe , East of USA ,Canada ,Southern Africa , Central Asia , Europe , Central America, Eastern Europe , South of USA

Order State : State of the region where the order is delivered

Order Status : Order Status : COMPLETE , PENDING , CLOSED , PENDING_PAYMENT ,CANCELED , PROCESSING ,SUSPECTED_FRAUD ,ON_HOLD ,PAYMENT_REVIEW

Product Card Id : Product code

Product Category Id : Product category code

Product Description : Product Description

Product Image : Link of visit and purchase of the product

Product Name : Product Name

Product Price : Product Price

Product Status : Status of the product stock :If it is 1 not available , 0 the product is available

Shipping date (DateOrders) : Exact date and time of shipment

Shipping Mode : The following shipping modes are presented : Standard Class , First Class , Second Class , Same Day

Import Libraries For Overview & EDA

Get The Dataset

Overview Of The Dataset

Data Cleaning

Columns

Type

Days for shipping (real)

Days for shipment (scheduled)

Benefit per order

Sales per customer

Delivery Status

Late_delivery_risk

Category Id

Category Name

Customer City

Customer Country

Customer Email

Customer Fname

Customer Id

Customer Lname

Customer Password

Customer Segment

Customer State

Customer Street

Customer Zipcode

Department Id

Department Name

Latitude

Longitude

Market

Order City

Order Country

Order Customer Id

Order date (DateOrders)

Order Id

Order Item Cardprod Id

Order Item Discount

Order Item Discount Rate

Order Item Id

Order Item Product Price

Order Item Profit Ratio

Order Item Quantity

Sales

Order Item Total

Order Profit Per Order

Order Region

Order State

Order Status

Order Zipcode

Product Card Id

Product Category Id

Product Description

Product Image

Product Name

Product Price

Product Status

Shipping date (DateOrders)

Shipping Mode

Drop Non-Important Columns¶

Drop Missing Data

EDA: Exploratory Data Analysis

Auto EDA with Sweetviz

Association Rule Mining: Apriori

Build Dataset For Predicting

Hash Encoding

Train Valid Test Split

Train Dataset:

Set of data used for learning (by the model), that is, to fit the parameters to the machine learning model

Valid Dataset:

Set of data used to provide an unbiased evaluation of a model fitted on the training dataset while tuning model hyperparameters. Also play a role in other forms of model preparation, such as feature selection, threshold cut-off selection.

Test Dataset:

Set of data used to provide an unbiased evaluation of a final model fitted on the training dataset.

Sale Prediction

Linear Regression

Evaluation

Mean Absolute Error: (MAE) is the mean of the absolute value of the errors:

$$\frac 1n\sum_{i=1}^n|y_i-\hat{y}_i|$$

Mean Squared Error: (MSE) is the mean of the squared errors:

$$\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)^2$$

Root Mean Squared Error: (RMSE) is the square root of the mean of the squared errors:

$$\sqrt{\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)^2}$$

R² Score: R² score also known as the coefficient of determination gives the measure of how good a model fits to a given dataset. It indicates how closer are the predicted values to the actual values:

$$1-(\sum_{i=1}^n(y_i-\hat{y}_i) / \sum_{i=1}^n(y_i-\bar{y}_i)^2)$$

Adjusted R² Score: Adjusted R² is a modified form of R² that penalizes the addition of new independent variable or predictor and only increases if the new independent variable or predictor enhances the model performance:

$$1-((1-R^2) * ((n-1) / (n-k-1)))$$

R² : It is R² Score

n : Number of Samples in our Dataset

k : Number of Predictors

Ridge Regression

Evaluation

Lasso Regression

Evaluation

Decision Tree Regression

Evaluation

Random Forest Regression

Evaluation

Bayesian Regression

Evaluation

Test Models

1. Linear Regression

Evaluation

2. Ridge Regression

Evaluation

3. Lasso Regression

Evaluation

4. Decision Tree Regression

Evaluation

5. Random Forest Regression

Evaluation

6. Bayesian Regression

Evaluation

Comparison Of Regression Algorithms To Predict Sales Of Each Order

Late Delivery Risk Prediction

Decision Trees

Random Forest

Bagging Classifier Algorithm

Gradient Boosting Classifier

XGboost Classifier

Test Models

1. Decision Trees

2. Random Forest

3. Bagging Classifier

4. Gradient Boosting Classifier

5. XGboost Classifier

Comparison Of Classification Algorithms To Predict The Risk Of On-Time Delivery Of Each Order

END =)